Back

The Lancet Digital Health

Elsevier BV

Preprints posted in the last 90 days, ranked by how well they match The Lancet Digital Health's content profile, based on 25 papers previously published here. The average preprint has a 0.04% match score for this journal, so anything above that is already an above-average fit.

1
Inpatient diagnostic odysseys in rare diseases: a nationwide audited Orphanet ICD-10 DRG/GRD-IR analysis in Chile, 2019-2024

Gomez-Vargas, G. A.; Repetto, G. M.; Bravo, L.; Castillo-Laborde, C.; Delgado, I.; Matute, I.

2026-05-03 health informatics 10.64898/2026.05.01.26352213 medRxiv
Top 0.1%
17.3%
Show abstract

BackgroundRare diseases (RD; enfermedades poco frecuentes, EPoF, in Chilean policy terminology) collectively affect 3.5-5.9% of the population and are associated with long diagnostic trajectories. Chile lacks a reproducible national operational definition for identifying RD in administrative hospital data. MethodsWe conducted a retrospective observational analysis of Chilean GRD-IR events (IR-29301 version) for 2019-2024 released through FONASA Datos Abiertos, covering hospital discharges and major ambulatory surgery reported by 72 public establishments for FONASA-covered persons. The canonical analytical cohort contained 5,779,482 DRG events in 4,027,921 linked patients. We constructed a Chilean Orphanet-ICD-10 homologation and audited it through an agentic human-in-the-loop pipeline, yielding a conservative RD operational catalogue (434 final ICD-10 codes in the KEEP + MAP_TO_SPECIFIC_ORPHA scenario). RD-coded DRG events were labeled as observed inpatient odysseys when at least one prior DRG event existed for the same patient. We quantified prior events, DRG-observed inpatient trajectory time, nonspecific prior diagnoses, DRG weight, and bridge-code associations. Bridge-code enrichment was estimated using patient-level Fisher exact tests with Benjamini-Hochberg false-discovery correction; event-level estimates were retained as sensitivity analyses. ResultsThe audited conservative catalogue identified 55,284 primary-diagnosis RD-coded DRG events in 45,784 patients and 374,866 RD-coded events in any diagnostic field. We characterized 63,685 observed inpatient odyssey cases in 25,648 unique patients across 371 audited RD ICD-10 codes. Median DRG-observed inpatient trajectory time to RD-coded diagnosis was 241 days, and mean prior events per odyssey was 8.1. Bridge-code analysis identified 616 associations with support [&ge;] 10 patients and 390 with q < 0.05; 350 significant associations were no-same-code administrative trajectory signals. These signals varied in interpretation, including clinically plausible precursors, diagnostic refinement, and care-process bridges. The Odyssey Index reordered conditions relative to raw prior-event counts, separating high-volume entities from stronger trajectory signatures. ConclusionsTo our knowledge, we provide the first nationwide audited and reproducible characterization of inpatient RD diagnostic odysseys in Latin America using administrative hospital data. The framework supports trajectory surveillance, registry design, quality-control analyses, and prioritization of candidate signals for prospective clinical validation under Chiles Law 21,743. Bridge-code associations should be interpreted as statistically enriched administrative signals, not as validated causal or clinical pathways. O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=112 SRC="FIGDIR/small/26352213v1_ufig1.gif" ALT="Figure 1"> View larger version (57K): org.highwire.dtl.DTLVardef@42593borg.highwire.dtl.DTLVardef@1f0690aorg.highwire.dtl.DTLVardef@803365org.highwire.dtl.DTLVardef@ae41af_HPS_FORMAT_FIGEXP M_FIG Graphical abstract. Updated canonical FONASA DRG/GRD-IR 2019-2024 cohort, audited RD catalogue, odyssey cohort, and bridge-code signal summary. C_FIG

2
Aging Signals on Chest Radiographs: Association of Chest Radiograph-Derived Age Acceleration With Future Lung Cancer Incidence

Mitsuyama, Y.; Walston, S. L.; Takita, H.; Saito, K.; Ueda, D.

2026-03-31 radiology and imaging 10.64898/2026.03.30.26349022 medRxiv
Top 0.1%
10.7%
Show abstract

Purpose: To evaluate whether chest radiograph-derived age acceleration is associated with incident lung cancer and whether it improves discrimination beyond established lung cancer risk factors. Materials and Methods: This retrospective analysis used prospectively collected data from the Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial. Baseline digitized chest radiographs from the initial screening year were analyzed using a previously validated deep learning model that estimates chest radiograph-derived age (Xp-age). Age acceleration (AgeAccel) was defined as the residual of Xp-age after calibration to chronological age using a regression model from the development dataset. A 1-year landmark design excluded participants diagnosed with lung cancer or censored within 1 year of baseline. Associations with incident lung cancer were assessed using multivariable Cox proportional-hazards models adjusted for prespecified demographic and clinical predictors, including smoking variables used in the PLCOm2012 risk prediction model. Discrimination was evaluated using the concordance index and 6-year time-dependent area under the receiver-operating-characteristic curve. Results: The analytic cohort included 23,213 participants (mean age, 62.5 years); 790 developed incident lung cancer after the landmark (mean follow-up, 16.7 years). Higher AgeAccel was associated with increased lung cancer incidence (hazard ratio, 1.10 per 1-SD increase; 95% confidence interval: 1.03- 1.17); however, addition of AgeAccel to an established risk factor model resulted in minimal change in discrimination (C-index, 0.840 vs. 0.839; time-dependent AUC at 6 years, 0.852 vs. 0.852). Attribution maps emphasized the aortic arch/mediastinal region with similar spatial patterns across smoking and lung cancer strata. Conclusion: Chest radiograph-derived age acceleration was independently associated with future lung cancer incidence.

3
Classifying and Differentiating Individuals with Respiratory Syncytial Virus, Influenza, and COVID-19 Cases in OpenSAFELY Between 2016 and 2024

Prestige, E.; Warren-Gash, C.; Quint, J. K.; Evans, D.; Costello, R. E.; Mehrkar, A.; Bacon, S.; Goldacre, B.; Barley-McMullen, S.; Yameen, F.; Shah, P.; Natt, M.; Alder, Y.; Hulme, W. J.; Parker, E. P. K.; Eggo, R. M.

2026-04-18 infectious diseases 10.64898/2026.04.09.26350495 medRxiv
Top 0.1%
8.2%
Show abstract

Electronic health records (EHRs) are a rich source of data which can be used to analyse health outcomes using computable phenotypes. With the approval of NHS England we used the OpenSAFELY secure analytics platform to design and assess phenotypes to classify three key respiratory viruses - respiratory syncytial virus (RSV), influenza, and COVID-19 - in English coded health data between September 2016 and August 2024. We compared specific and sensitive phenotypes to one another and to publicly available surveillance data. Cases from both phenotypes showed similar seasonal patterns to surveillance data. Sensitive phenotypes led to increased risk of misclassification than specific phenotypes for mild cases. For severe cases the risk of misclassification was higher in infants than for older adults, irrespective of the phenotype used. The phenotypes presented here offer a solution to classifying respiratory viruses from coded health records in the absence of testing information.

4
Agentic Authoring of OMOP Concept Sets from Natural Language

Chen, H.; He, X.; Dai, H.; Huang, Y.; Liu, M.; Bian, J.

2026-06-03 health informatics 10.64898/2026.06.02.26354704 medRxiv
Top 0.1%
7.3%
Show abstract

Authoring OMOP concept sets from free-text descriptions remains a major bottleneck in scalable computable phenotyping for observational research. Existing tools support parts of this workflow but are designed primarily for interactive expert use rather than autonomous large language model (LLM) agents. We present an agentic framework that automatically generates OMOP concept sets by combining vocabulary tools, ontology extensions (RxClass, LOINC, and Disease Ontology), and procedural guidance. In ablation studies, the best configuration achieved Recall@100 of 0.965 and AP@100 of 0.875 on the development set. Cohort-level validation against OMOP-mapped EHR data yielded precision of 0.970, recall of 0.998, and a Jaccard index of 0.968. On an independent silver-standard benchmark of 457 concept-vocabulary pairs from 15 AD/ADRD target trial emulation studies, Recall@100 reached 0.835 and AP@100 reached 0.786. Task-specific tools outperformed unrestricted SQL access and PHOEBE 2.0, while progressive guidance performed best.

5
AI-Assisted Pneumonia Detection, Localisation and Report Generation from Chest X-rays

Boiardi, F. E.; Lain, A. D.; Posma, J. M.

2026-03-23 radiology and imaging 10.64898/2026.03.20.26348879 medRxiv
Top 0.1%
7.3%
Show abstract

Pneumonia detection in chest X-rays (CXRs) is complicated by high inter-observer variability and overlapping radiographic patterns. While deep learning (DL) solutions show promise, limitations in generalisability and explainability hinder clinical adoption. We address these challenges by introducing a holistic DL-based computer-aided diagnosis (CAD) pipeline for pneumonia detection, localisation, and structured report generation from CXRs. We curated the largest composite of publicly available CXRs to date (N=922,634), of which [Formula] were used for training. MIMIC-CXR radiology reports were relabelled using a local large language model (LLM), positing that LLM-derived pneumonia labels would yield higher diagnostic sensitivity than the provided rule-based natural language processing (rNLP) labels. DenseNet-121 classifiers were trained on four configurations: MIMIC-CXR (rNLP), MIMIC-CXR (LLM), and each supplemented with VinDr-CXR data. Gradient-weighted Class Activation Mapping (Grad-CAM) provided visual explainability and lung zone-based localisation. LLM-driven relabelling significantly improved human-label agreement (96.5% vs 72.5%, P=1.66x10-11). The best-performing model (MIMIC-CXR (LLM) + VinDr-CXR) achieved 82.08% sensitivity and 81.97% precision, surpassing both radiologist sensitivity ranges (64-77.7%) and CheXNets pneumonia F1-score (43.5%). Grad-CAM localisation attained a moderate F1-score of 52.9% (sensitivity=65.7%, precision=44.3%), confirming focus alignment with pathological lung regions while highlighting areas for refinement. These findings demonstrate that LLM-driven label curation, combined with DL, can exceed conventional rNLP and radiologist performance, advancing high-quality data integration in predictive medical imaging. Clinically, our pipeline offers rapid triage, automated report drafting, and real-time pneumonia surveillance; tools that can streamline radiology workflows and mitigate diagnostic errors.

6
Cadence: A Benchmark Evaluation of the Narrative Velocity Framework for Next Clinical Event Prediction in MIMIC-IV

Rouhollahi, A.; Nezami, F. R.

2026-05-11 bioinformatics 10.64898/2026.05.06.722409 medRxiv
Top 0.1%
6.9%
Show abstract

ObjectiveHow structured clinical features and cluster-semantic embeddings interact under self-distillation in EHR prediction models is unknown. Existing approaches treat these sources separately (gradient-boosted trees exploit tabular features while sequence models process text), and their interaction under self-distillation regularisation remains uncharacterised. We introduce the Narrative Velocity (NV) framework and evaluate this interaction in a 7-model benchmark. Materials and MethodsCadence is a [~]5.86M-parameter residual multilayer perceptron (MLP) combining structured EHR features with frozen PubMedBERT embeddings of cluster-label strings under born-again self-distillation from a prior Cadence checkpoint (seed-42 teacher; [1]). Cadence is benchmarked against six comparators on MIMIC-IV v3.1 with dual-sex TRIPOD+AI reporting (5 student seeds for Cadence; 2-3 seeds for baselines). ResultsAt full-cohort scale, Cadence achieves 38.04 {+/-} 0.04% male and 35.66 {+/-} 0.04% female top-1 accuracy, exceeding the strongest non-neural baseline (XGBoost-2420, trained on the identical 2,420-dimensional input) by +1.35 pp male and +0.82 pp female (paired t-test on shared seeds 42-44: t(2) = 69.06, p = 2.10 x 10-4 male; t(2) = 25.32, p = 1.56 x 10-3 female). On time-to-next-event regression Cadence lowers MAE by 7.68 d male and 7.30 d female versus XGBoost-2420; FT-Transformer attains the lowest absolute MAE at full scale (27.58 d male, 36.63 d female), revealing a classification-regression trade-off across model families. A controlled 2 x 2 random-vector ablation isolates the self-distillation-embedding interaction at +0.49 pp top-1 (95% CI [0.35, 0.64] pp; bootstrap, n = 10,000 resamples; 3-teacher-seed mean +0.513 {+/-} 0.010 pp) under a matched-dimensionality null. A 3-teacher-seed validation (multi_teacher_02) confirms the interaction is robust to teacher-seed identity (per-seed values +0.525, +0.509, +0.507 pp; mean +0.513 {+/-} 0.010 pp). Cadence achieves the best Brier score among evaluated models (0.774 male / 0.798 female) but its raw probabilities are systematically miscalibrated (ECE 0.077 vs. XGBoost-884s 0.010); after a single scalar temperature scaling step (T * {approx} 0.81), ECE drops to {approx}0.028 while Brier remains best. On a small (n = 1,120 patients, 39,120 events) external OCR-extracted BWH cohort, Cadence ranked 3rd of 7 models with three confounded sources of error (institutional shift, OCR noise, centroid mapping); we therefore report this as a generalisation probe rather than a definitive external validation. At the longer h30 evaluation horizon Cadences MAE advantage reverses (47.35 d versus XGBoost 45.06 d), reflecting the absence of a matched-horizon self-distillation teacher. DiscussionThe 2 x 2 random-vector ablation confirms that the self-distillation gain on PubMedBERT embeddings (+0.78 pp) exceeds that on matched-dimensionality random vectors (+0.29 pp) by +0.49 pp, isolating the interaction to semantic content rather than feature dimensionality. The factorial decomposition (+0.49-0.51 pp interaction) and the sequential pipeline-level decomposition (Supplementary Table S3) are complementary triangulations under different reference frames and are not directly additive. ConclusionThis 7-model benchmark establishes a dual-sex, dual-metric, cross-institutional reference for next clinical event prediction under the TRIPOD+AI reporting framework. These results characterise discrimination and calibration on a single retrospective cohort; prospective evaluation, decision-curve analysis, and harm-benefit assessment are required before clinical deployment.

7
First-Trimester Multi-modal cfDNA Analysis for Prediction of Preterm and Term Preeclampsia

Ertl, R.; Syngelaki, A.; Frank, O.; Lueftinger, L.; Lukacova, E.; Lumby, C.; Stuetz, A.; Beisken, S.; Posch, A. E.; Nicolaides, K. H.

2026-03-13 obstetrics and gynecology 10.64898/2026.03.12.26348234 medRxiv
Top 0.1%
6.5%
Show abstract

BackgroundPreeclampsia, which is a leading cause of maternal and perinatal mortality and morbidity, represents a biologically heterogeneous syndrome. First-trimester screening with the Fetal Medicine Foundation competing-risks model enables prevention of preterm preeclampsia through aspirin prophylaxis but depends on Doppler velocimetry and biochemical measurements that limit scalability and offer limited discrimination for term disease. A unified, molecular first trimester test capable of stratifying risk across the full clinical spectrum of preeclampsia has not been established. ObjectiveTo determine whether multi-modal, tissue-resolved analysis of first trimester circulating cell-free DNA (cfDNA), obtained during routine non-invasive prenatal testing (NIPT), enables early prediction of both preterm and term preeclampsia. Study DesignThis nested case-control study included 125 singleton pregnancies sampled at 11-14 weeks gestation after quality control (48 controls, 30 preterm preeclampsia, 47 term preeclampsia). For 80 pregnancies, matched placental villi and maternal buffy coat samples were available to derive tissue reference profiles. Plasma cfDNA underwent multi-modal sequencing using Oxford Nanopore Technologies, enabling tissue-resolved analysis of fragmentomic and epigenetic signatures. Separate ensemble machine-learning classifiers were developed for preterm (<37 weeks) and term ([&ge;]37 weeks) preeclampsia using stratified 10-fold cross-validation. Model discrimination was evaluated using area under the receiver operating characteristic curve (AUROC), sensitivity at predefined specificity thresholds, and comparison with the FMF first-trimester risk score. A population-level simulation of 100,000 pregnancies, applying incidence point estimates of 2.5% for preterm and 7.5% for term PE, was used to derive predictive values and likelihood ratios. ResultsThe multi-modal cfDNA classifier achieved an AUROC (95% CI) of 0.85 (0.77-0.91) for preterm preeclampsia and 0.84 (0.76-0.91) for term preeclampsia. The FMF score yielded an AUROC of 0.80 (0.70-0.89) for preterm and 0.53 (0.43-0.63) for term PE. At 80% specificity, cfDNA sensitivity was 70.5% for preterm and 72.1% for term preeclampsia, demonstrating improved discrimination for term disease compared with FMF screening. In simulated population-level analysis, positive likelihood ratios were 4.25 (preterm) and 3.83 (term), with negative likelihood ratios of 0.21 and 0.34, respectively, supporting meaningful post-test risk stratification and strong rule-out performance. ConclusionFirst-trimester multi-modal, tissue-resolved cfDNA analysis enables early risk stratification across the full clinical spectrum of preeclampsia from a single routine blood sample. Compared with FMF screening, this approach can potentially improve discrimination for term preeclampsia while providing incremental improvement for preterm disease. The potential for integration into existing NIPT workflows offers a scalable pathway toward unified precision prevention, supporting timely aspirin prophylaxis for preterm preeclampsia and risk-adapted surveillance strategies for term disease.

8
Learning Effects from A GenAI-based Clinical Decision Support System in Primary Healthcare

Mateen, B.; Williams, G.; Korom, R.; Mwaniki, P.; Emmanual-Fabula, M.; Agweyu, A.

2026-05-15 primary care research 10.64898/2026.05.11.26352964 medRxiv
Top 0.1%
6.4%
Show abstract

To characterise the potential learning effects from a GenAI-based clinical decision support tool (CDST), we examined clinician behaviour within a cluster-randomised trial. The tool, AI Consult, parsed clinician notes written (in real-time) to document patient encounters and would raise green, yellow, or red flags to indicate no, potential, or critical risks of harm (respectively) in decisions the clinician made. Over several months, clinicians with access to the AI Consult tool produced fewer red (Intervention: 14% reduction, p = 0.032 vs. Control: 6% increase, p = 0.383) and yellow flags (Intervention: 6.8% reduction, p = 0.005 vs. Control: 3% increase, p = 0.231), whereas those without access to the tool showed no such effect. If this type of learning effect is a consistent emergent property across CDSTs, there might be an opportunity to reimagine their purpose: from addressing gaps in care quality to instead being a health system-strengthening investment.

9
BREATHE: A realist evaluation protocol to understand how smoking cessation services support pregnant women in areas of social deprivation

Carlisle, N.; Zhang, M.; Simpson, N.; Stacey, T.

2026-06-10 obstetrics and gynecology 10.64898/2026.06.04.26354590 medRxiv
Top 0.1%
6.4%
Show abstract

Background Tobacco smoking during pregnancy increases the risk of preterm birth, small for gestational age (SGA), stillbirth, and longer-term adverse health outcomes. Globally, reducing smoking in pregnancy is a key public health priority, yet the organisation, accessibility, and effectiveness of cessation support varies substantially between countries and healthcare systems. Differences in policy implementation, resource allocation, and integration of cessation services into antenatal care influence uptake and success rates across diverse settings. In England, pregnant women are entitled to free smoking cessation support, however, service delivery varies across regions with mixed efficacy. While tobacco smoking is more prevalent in deprived communities, there is limited understanding of how, why, for whom, and under what circumstances these services are most effective, particularly in areas of social deprivation, such as the North East and Yorkshire. Objective To conduct a realist evaluation to understand how smoking cessation services support pregnant women in areas of social deprivation to stop smoking and reduce adverse perinatal outcomes. Methods This multi-site realist evaluation will be conducted across three NHS maternity services in West Yorkshire, England. The study comprises four iterative stages: (1) development of initial programme theories through realist-informed literature scoping and stakeholder consultation; (2) case study data collection including qualitative interviews with pregnant women (approximately 15-30) and staff (approximately 15-30); (3) analysis of routine anonymised maternity and neonatal electronic data collected over a one-year period; and (4) realist analysis to refine context-mechanism-outcome (CMO) configurations. Qualitative data will be analysed using realist logic supported by NVivo software. Quantitative data will be analysed using descriptive and inferential statistics to explore associations between smoking cessation engagement and perinatal outcomes. Ethics and dissemination Ethical approval was obtained through the UK Health Research Authority and a Research Ethics Committee prior to study commencement (IRAS 364173; REC reference number 26/SC/0020). Findings will inform recommendations to improve smoking cessation support for pregnant women in deprived areas. Results will be disseminated through peer-reviewed publications, conference presentations, and stakeholder engagement.

10
Pneumonia Detection in Paediatric Chest X-Rays using Ensembled Large Language Models

Tan, J.; Tang, P. H.

2026-04-12 radiology and imaging 10.64898/2026.04.10.26347909 medRxiv
Top 0.1%
6.4%
Show abstract

BackgroundPa4ediatric pneumonia is a major cause of childhood morbidity and mortality. Chest X-rays (CXR) are central to diagnosis, but shortages of specialist radiologists can delay reporting. Multimodal large language models (MLLMs) may assist clinical workflows by analysing images and communicating findings, although their diagnostic performance remains below state-of-the-art classifiers. ObjectiveTo evaluate whether ensemble strategies improve MLLM diagnostic performance for paediatric radiological pneumonia detection on CXRs. MethodsIn this retrospective study, paediatric CXRs from two datasets (balanced and real-world) at KK Womens and Childrens Hospital were analysed. Images were independently reviewed by two board-certified radiologists, with pneumonia severity assigned to three classes using a predefined consensus algorithm. Fifteen MedGemma-4B-it agents classified each CXR into five likelihood categories, which were mapped to the three severity classes for evaluation. Majority voting, soft voting and GPTOSS-20B aggregation were compared with baseline average agent performance. The primary outcome was One-vs-Rest (OvR) AUROC. Secondary metrics included accuracy, sensitivity, specificity, F1-score, Cohens {kappa} and One-vs-One (OvO) AUROC. ResultsThe balanced dataset contained 900 CXRs and the real-world dataset 1300 CXRs. Soft voting significantly improved OvR-AUROC compared with baseline in both datasets (Balanced: 0.829>0.764; 95%CI=0.752-0.779; P=0.0002. Real-world: 0.728>0.655; 95%CI=0.638-0.679; P=0.0003). Soft voting also improved accuracy, Cohens {kappa}, OvO-AUROC in both datasets and F1-score in the balanced dataset. ConclusionSoft voting enhances MedGemmas diagnostic discriminatory performance for paediatric radiological pneumonia detection. Our system enables privacy-preserving, near real-time clinical decision support with explainable outputs, having potential for integration into emergency departments. Our systems high specificity supports triage by flagging high-risk radiological pneumonia cases. Clinical ImpactO_LIPaediatric CXRs often face reporting delays exceeding 24 hours due to radiologist shortages. C_LIO_LIOur proposed MLLM ensemble framework achieves better than average MLLM diagnostic discrimination for radiological pneumonia without requiring cloud-based systems. C_LIO_LISoft-voting aggregation enhances diagnostic discriminatory effectiveness for paediatric pneumonia severity, while preserving explainable outputs. C_LIO_LIOur system acts as a decision support tool that identifies higher-risk pneumonia cases for urgent review, supporting safer triage. C_LI

11
System-specific multimorbidity derived from prescribing data predicts colorectal cancer outcomes: a Scottish data-linkage study.

Barnett, K. N.; Williams, L.; Weller, D.; Mercer, S. W.; Guthrie, B.; Ward, H.; Brewster, D. H.; Hubbard, G.; Campbell, C.

2026-06-02 primary care research 10.64898/2026.05.30.26354508 medRxiv
Top 0.1%
6.3%
Show abstract

Multimorbidity, the co-existence of two or more long-term conditions, is up to three times more prevalent among people with cancer than in the general population and is associated with poorer survival, particularly for cancers with a more favourable prognosis such as colorectal cancer. In Scotland, multimorbidity is the norm among older adults, emerges earlier in socioeconomically deprived populations, and may contribute to comparatively low cancer survival rates. Despite this, the influence of multimorbidity on the colorectal cancer pathway remains poorly understood. We conducted a Scottish data-linkage study of adults diagnosed with colorectal cancer between 2010 and 2014, linking the Scottish Cancer Registry to national prescribing, hospital admissions, death registration, and bowel screening datasets. Prescribing data were used to derive overall and system-specific comorbidity measures as a proxy for multimorbidity and active disease burden. Associations with stage at diagnosis, treatment, survival, and screening uptake were examined using logistic regression and Cox proportional hazards models adjusted for demographic and clinical covariates. Among 19,043 patients, 87% had at least one prescribing-based comorbidity, most commonly cardiovascular, nervous system, and gastrointestinal conditions. Overall comorbidity burden was not associated with stage at diagnosis, although laxative-related prescribing was associated with later-stage disease. Increasing comorbidity burden reduced the likelihood of receiving any treatment and surgery, while associations varied across system-specific comorbidities. Higher comorbidity burden was also associated with increased all-cause and colorectal cancer-specific mortality, particularly among patients with respiratory, nervous system, and haematological/nutritional conditions. Screening uptake was not associated with overall comorbidity burden but did differ by system-specific comorbidity. Prescribing-based multimorbidity was highly prevalent and strongly associated with treatment patterns and mortality among patients with colorectal cancer. System-specific multimorbidity measures provided greater discrimination than overall morbidity counts, highlighting the importance of considering distinct multimorbidity profiles when assessing cancer pathways and designing targeted interventions for optimising treatment and survival. Keywords (primary health care, general practice, multimorbidity, comorbidity, colorectal cancer, early diagnosis, cancer treatment, survival)

12
High-Throughput Observational Evidence Generation Using Linked Electronic Health Record and Claims Data

Gombar, S.; Shah, N.; Sanghavi, N.; Coyle, J.; Mukerji, A.; Chappelka, M.

2026-04-07 health informatics 10.64898/2026.04.07.26350300 medRxiv
Top 0.1%
6.3%
Show abstract

Background: The observational literature on comparative effectiveness is expanding rapidly but remains difficult to synthesize. Discordant findings often stem from structural differences in cohort definitions, inclusion criteria, and follow up windows, leaving stakeholders without a cohesive evidence base. Furthermore, studies typically focus on a narrow subset of outcomes, neglecting the broader needs of diverse healthcare stakeholders 1,2,3,4. Methods We developed a high throughput evidence generation workflow using linked EHR and administrative claims data. The cornerstone is a prespecified measurement architecture applied uniformly across clinical scenarios: six post index windows (acute to two year follow.up); 28 Elixhauser comorbidities; 14 healthcare resource utilization (HCRU) categories; 29 laboratory measures with 52 binary thresholds; and 42 adverse event categories. We generated unadjusted treatment comparisons across ~1,038 outcomes per scenario, including effect-measure modification (EMM) assessments across 130 baseline features. Results Across 40 clinical domains, the workflow produced approximately 32,982,552 outcome evaluations. An evaluation included a treatment comparison outcome population effect estimate with uncertainty bounds and supporting diagnostics. Approximately 5,000 narrative summaries underwent structured clinical and statistical quality control before dissemination. Conclusions Standardized, high throughput workflows can shift evidence generation away from fragmented studies toward comprehensive evidence packages. This shared evidence base supports precision medicine by making treatment effect heterogeneity visible across clinically meaningful subpopulations, reducing the need for redundant, stakeholder-specific studies.

13
Harmonising UK primary care prescription records for research: A case study in the UK Biobank

Ytsma, C. R.; Torralbo, A.; Fitzpatrick, N. K.; Pietzner, M.; Louloudis, I.; Nguyen, D.; Ansarey, S.; Denaxas, S.

2026-04-22 health informatics 10.64898/2026.04.21.26351274 medRxiv
Top 0.1%
6.3%
Show abstract

ObjectiveThe aim of this study was to develop and validate an automated, scalable framework to harmonise fragmented UK primary care prescription records into a research-ready dataset by mapping four diverse medical ontologies to a unified, historically comprehensive reference standard. Materials and MethodsWe used raw prescription records for consented participants in the UK Biobank, in which participants are uniquely characterized by multiple data modalities. Primary care data were preprocessed by selecting one drug code if multiple were recorded, cleaning codes to match reference presentations, expanding code granularity based on drug descriptions, and updating outdated codes to a single reference version. Harmonisation entailed mapping British National Formulary (BNF) and Read2 codes to dm+d, the universal NHS standard vocabulary for uniquely identifying and prescribing medicines. Harmonised dm+d records were then homogenised to a single concept granularity, the Virtual Medicinal Product (VMP). We validated our methods by creating medication profiles mapping contemporary drug prescribing patterns in 312 physical and mental health conditions. ResultsWe preprocessed 57,659,844 records (100%) from 221,868 participants (100%). Of those, 48,950 records were dropped due to lack of drug code. 7,357,572 records (13%) used multiple ontologies. Most (76%) records were encoded in BNF and most had the code granularity expanded via the drug description (N=28,034,282; 49%). 41,244,315 records (72%) were harmonised to dm+d and 99.98% of these were converted to VMP as a homogeneous dataset. Across 312 diseases, we identified 23,352 disease-drug associations with 237 medications (represented as BNF subparagraphs) that survived statistical correction of which most resembled drug - indication pairs. ConclusionOur methodology converts highly fragmented and raw prescription records with inconsistent data quality into a streamlined, enriched dataset at a single reference, version, and granularity of information. Harmonised prescription records can be easily utilised by researchers to perform large-scale analyses in research.

14
Unmeasured but Not Unbiased: The Missingness Demographic Leakage Audit (MDLA) for Calibration-Aware Fairness Evaluation in Critical Care Mortality Prediction

Patel, K.; Beedala, P.

2026-05-03 health informatics 10.64898/2026.05.01.26352193 medRxiv
Top 0.1%
6.3%
Show abstract

ObjectiveClinical prediction models trained on electronic health records are routinely evaluated for fairness on observed feature values, but the informativeness of which measurements are absent remains unaudited. We developed the Missingness Demographic Leakage Audit (MDLA), a reproducible four-step informatics framework that tests whether patterns of clinical measurement absence function as latent demographic proxies -- constituting a bias pathway invisible to standard fairness audits. Materials and MethodsWe applied MDLA across development (MIMIC-IV v2.2; n=50,827; mortality 10.2%) and external validation (eICU-CRD v2.0; n=137,773; mortality 9.5%) cohorts following TRIPOD+AI standards. XGBoost, random forest, and logistic regression were trained on 43 clinical features and 44 binary missingness indicators. MDLA quantified demographic predictability from missingness alone, tested feature-level associations with Bonferroni correction, and verified model reliance via ablation. A calibration-aware fairness audit evaluated five criteria across four demographic axes; six post-hoc recalibration strategies were compared on a fairness-utility Pareto frontier. ResultsMissingness indicators alone predicted racial group membership above chance (AUROC=0.543; 95% CI, 0.540-0.546), with 18 of 43 features showing Bonferroni-significant race-missingness associations (all Cramers V<0.10). Ablation confirmed model reliance: adding missingness indicators increased racial AUROC disparity by 10.7% (0.063 to 0.069) without improving global performance. XGBoost achieved AUROC=0.910 internally (AUROC=0.799 on external validation). Global Platt recalibration reduced overall calibration error by 94% and maximum racial calibration error by 51%, with zero AUROC loss and successful parameter transfer to external validation without retraining. ConclusionMDLA provides a structured, reproducible protocol for detecting missingness-encoded demographic signals prior to model deployment. Applied across 188,600 ICU patient-stays from two institutionally diverse databases, it identified a statistically confirmed but subtle bias pathway undetectable by standard fairness audits. Missingness-aware auditing and calibration-aware evaluation should be integrated into clinical AI validation pipelines.

15
Geographic variation in pregnancy associated overdose and substance use disorder mortality, 2016 to 2022

Kramer, M. R.; Peterson, E. N.; Cooper, H. L.

2026-03-17 obstetrics and gynecology 10.64898/2026.03.15.26348441 medRxiv
Top 0.1%
6.3%
Show abstract

ImportanceDrug-related pregnancy-associated mortality is a leading contributor to the US maternal mortality crisis, yet whether it follows the persistent rural disadvantage documented for all-cause maternal mortality--or is restructured by the geographic dynamics of drug markets--has not been established. ObjectiveTo characterize geographic variation in pregnancy-associated overdose (OD) and substance use disorder (SUD) mortality across the rural-urban continuum and by US Census region from 2016 through 2022. Design, Setting, and ParticipantsNational population-based surveillance study using individual-level National Vital Statistics System (NVSS) mortality and natality records. Pregnancy-associated deaths (occurring during pregnancy or within one year of the end of pregnancy) were ascertained among 25,007,723 live births during 2016-2022 using the NVSS 2018 algorithm. ExposuresRural-urban classification cross-classified by four US Census regions. Main Outcomes and MeasuresRates of pregnancy-associated OD mortality and SUD mortality per 100,000 live births. Post-COVID excess OD mortality was estimated using a Bayesian hierarchical Poisson model. ResultsThere were 516 OD deaths (2.06 per 100,000 live births) and 1,080 SUD deaths (4.32 per 100,000) nationally; SUD exceeded OD mortality more than two-to-one in all strata, and both outcomes were concentrated in the late postpartum period (43 days to 1 year). OD mortality converged across the rural-urban gradient during the COVID era (2020-2022)--the inverse of the persistent rural disadvantage in all-cause maternal mortality--with metropolitan areas falling below pre-pandemic trajectory expectations while non-metropolitan areas exceeded theirs. Credible excess OD mortality was identified in non-metropolitan Southern and Northeastern counties. SUD rates were non-monotonic across urbanicity, with metro-adjacent counties carrying elevated rates in all regions. Conclusions and RelevanceDrug-related pregnancy-associated mortality follows a distinct geographic logic from all-cause maternal mortality, shaped by drug supply dynamics and harm reduction geography rather than obstetric care infrastructure alone. The convergence of OD mortality across the rural-urban gradient, the dominance of SUD over acute overdose, and the concentration of deaths in the late postpartum year point to care and surveillance gaps requiring integrated obstetric and addiction treatment, extended postpartum insurance coverage, and rural harm reduction capacity.

16
Interpretable Predictive Modeling for Medical Data Using Boolean Rule-aware Regression

Eskandarian, M.; Malekpour, S. A.

2026-05-18 bioinformatics 10.64898/2026.05.14.725084 medRxiv
Top 0.1%
6.3%
Show abstract

PurposeIn clinical practice, accurate prediction of disease risk must be accompanied by transparent, human-understandable explanations to support diagnostic confidence, guide therapeutic decisions, and meet ethical and regulatory standards. While deep neural networks achieve high predictive performance in tasks such as cancer detection and diabetes risk stratification, their black-box nature prevents clinicians from understanding the reasoning behind predictions, severely limiting trust and safe integration into patient care. MethodsWe present Regression-Based Boolean Rule (RBBR), a framework that automatically derives clinically interpretable Boolean rules directly from patient data. RBBR generates human-readable conjunctions (logical AND combinations) of up to three clinical features, transforms them into inputs for ridge regression to predict binary or multi-class disease outcomes, estimates rule importance via regularized coefficients, and selects the most parsimonious and predictive rule sets using the Bayesian Information Criterion. ResultsApplied to six real-world medical datasets (lung cancer screening and staging, Wisconsin and diagnostic breast cancer, heart failure, and early-stage diabetes risk), RBBR consistently produced concise, clinically meaningful rules - e.g., gender-specific symptom combinations in diabetes, distinct histopathological subpopulations in breast cancer, and symptom-risk factor interactions in lung cancer - with strong explanatory power (R2 up to 0.92) and competitive discrimination. ConclusionBy delivering logical, transparent decision rules aligned with clinical reasoning (if symptom A and B, then high risk), RBBR bridges the gap between predictive accuracy and bedside usability, enabling clinicians to validate predictions, identify high-risk patients, stratify subpopulations, and enhance shared decision-making in routine care.

17
Trade-offs in Cardiovascular Risk Prediction Using Race and Social Determinants of Health

Hammarlund, N.; Wang, X.; Grant, D.; Purves, D.

2026-04-04 cardiovascular medicine 10.64898/2026.04.02.26350089 medRxiv
Top 0.1%
6.2%
Show abstract

Importance: Health systems are increasingly adopting race-neutral cardiovascular risk prediction tools, yet no study has examined how these choices redistribute preventive treatment at the point of clinical decision-making, particularly for Black individuals who already bear a disproportionate cardiovascular burden. Objective: To evaluate how including race, substituting social determinants of health (SDoH), or excluding both reshapes cardiovascular risk classification, calibration, fairness, and clinical decisions. Design: Retrospective cohort study with repeated cross-validation and integrated decision-focused evaluation, using CARDIA study data with baseline measures from 2010 and cardiovascular outcomes through 2021. Setting: Community-based longitudinal cohort recruited across multiple U.S. cities. Participants: 3,241 Black and White adults without known cardiovascular disease at baseline. Main Outcomes and Measures: Three models predicting 10-year incident cardiovascular disease were compared on predictive performance, calibration, fairness metrics, and realized clinical utility at the ACC/AHA 7.5% preventive treatment threshold. Results: Among 3,241 participants (46% Black, mean age 50 years, 6.9% CVD incidence), overall performance was similar across models (AUC 0.762 to 0.768). Predictor choice substantially reshaped clinical decisions at the guideline threshold. The SDoH-based model improved parity metrics but produced systematic underprediction and concentrated new overtreatment among Black participants. The clinical-only model further improved parity metrics but generated new undertreatment, with four cases of untreated CVD and none avoided. No single evaluative dimension captured the full equity consequences. Conclusions and Relevance: Parity metrics improved under both race-neutral models, yet both produced clinical harms concentrated among Black participants not apparent in population-average metrics. The case for race removal has rested on conceptual grounds, but comprehensive empirical evaluation is necessary before health systems can be confident their model choices truly serve those most at risk.

18
Mother-infant linked UK electronic birth cohorts representing 17.5 million births harmonised to the OMOP common data model

Seaborne, M.; Durbaba, S.; Mendez-Villalon, A.; Giles, T.; Gonzalez-Izquierdo, A.; Hough, A.; Sanchez-Soriano, C.; Snell, H.; Cockburn, N.; Nirantharakumar, K.; Poston, L.; Reynolds, R.; Santorelli, G.; Brophy, S.

2026-03-25 public and global health 10.64898/2026.03.23.26349078 medRxiv
Top 0.1%
6.2%
Show abstract

We describe the harmonisation of five UK electronic birth cohorts to the Observational Medical Outcomes Partnership (OMOP) Common Data Model, creating a large scale, standardised resource for maternal and child health research. The Mother and Infant Research Data Analysis (MIREDA) partnership developed and implemented reproducible guidelines for mapping maternal infant relationships and identifying pregnancy episodes within routinely collected healthcare data. Cohorts from England, Scotland, and Wales were transformed despite substantial heterogeneity in data structure, coding systems, and variable definitions. The resulting harmonised resource preserves each cohort as an independent dataset while enabling federated analyses to be conducted across sites without the need to share individual level data. Collectively, the cohorts capture over 17.5 million live births, providing sufficient scale to investigate rare exposures and outcomes, support trial emulation, and evaluate population level policy impacts across the UK. This article details the transformation pipeline and provides reusable methods to support extension to additional cohorts and networks. The harmonised datasets enable interoperable, reproducible research and facilitate cross national comparative studies in maternal and child health.

19
Calibration Drift Under Cross-Institutional Deployment: An External Validation Framework for ICU Mortality Prediction Across MIMIC-IV and eICU

Patel, K.; Beedala, P.

2026-05-05 health informatics 10.64898/2026.05.03.26352335 medRxiv
Top 0.1%
6.2%
Show abstract

BackgroundMachine learning models for intensive care unit (ICU) mortality prediction achieve strong internal discrimination yet rarely undergo external validation with calibration assessment -- a gap undermining clinical deployment. Calibration, the agreement between predicted probabilities and observed event rates, is prerequisite for threshold-based decisions yet remains underreported. MethodsWe conducted a retrospective cohort study using MIMIC-IV (v2.2; n = 52,028 ICU stays) for model development and eICU (n = 114,060) for independent external validation. Logistic regression, random forest, and gradient boosting (XGBoost) were evaluated on first-24-hour clinical variables. Discrimination was assessed via receiver operating characteristic area (AUROC) and precision-recall area (AUPRC); calibration via slope, intercept, and expected calibration error (ECE). Post-hoc logistic recalibration was applied externally. Clinical utility was evaluated by decision curve analysis benchmarked against Acute Physiology and Chronic Health Evaluation (APACHE) scores. Subgroup analyses examined sex and race/ethnicity; SHapley Additive exPlanations (SHAP) assessed feature importance. Uncertainty was estimated via bootstrap resampling; the study adheres to TRIPOD guidelines. ResultsThe recalibrated XGBoost model achieved internal AUROC 0.847 (95% CI: 0.832-0.860) and external AUROC 0.819 (95% CI: 0.815-0.823). Internal calibration was near-ideal (slope 0.982; intercept 0.001), whereas external validation revealed systematic risk overestimation (intercept -0.678) attributable to prevalence-driven label shift. An intercept-only adjustment reduced ECE by 26%. The model outperformed APACHE (AUROC 0.817 vs. 0.795; p < 0.001). ConclusionsICU mortality models exhibit transportable discrimination but clinically significant calibration drift under cross-institutional deployment. Calibration evaluation and targeted recalibration should be mandatory in any clinical machine learning validation framework.

20
The FEES Dysphagia Index: a bias-resilient continuous score that captures expert clinical judgment in 2,943 neurological inpatients

Werner, C. J.; Sanchez-Garcia, E.; Mall, B.; Meyer, T.; Pinho, J.; Schulz, J. B.; Schumann-Werner, B.

2026-04-21 neurology 10.64898/2026.04.20.26351259 medRxiv
Top 0.1%
4.8%
Show abstract

Multi-consistency testing during flexible endoscopic evaluation of swallowing (FEES) is clinically necessary but introduces selection bias: worst scores inflate severity because the number of consistencies tested covaries with disease severity. In this retrospective observational study of hospitalized neurological patients, we derived and validated the FEES Dysphagia Index (FDI) in two temporally independent cohorts (Cohort 1: 2013-2018, N=1,257; Cohort 2: 2021-2025, N=1,686) from a single center. FDI-S averages Penetration-Aspiration Scale (PAS) scores across tested consistencies (0-100 scale); FDI-E uses Yale Pharyngeal Residue scores; FDI-C combines both. Selection bias was quantified using sequential branching-tree inverse probability weighting (IPW). Worst PAS overestimated severity by 24%; FDI deviated by <2%. FDI-C was significantly superior to Worst PAS for hospital-acquired pneumonia (HAP; AUC 0.70 vs. 0.60, p<0.001), mortality (0.71 vs. 0.62, p=0.040), and restricted oral intake (0.90 vs. 0.74, p<0.001), and statistically equivalent to clinician-rated severity. FDI-C mapped linearly onto ordinal Functional Oral Intake Scale values (FOIS; proportional odds RCS p=0.99). With functional status and diagnosis, FDI-C reconstructed the clinicians oral intake recommendation with AUC up to 0.93. The FDI-C-mortality relationship was sigmoidal with a clinically relevant transition zone between [~]50 and [~]85. FDI-C is a bias-resilient, bedside-calculable score with interval-scale properties that captures expert clinical judgment, suitable as both a clinical decision support tool and a continuous research endpoint.